Data is just a collection of numbers until you turn it into a story.

Table of Contents

About

Cardiovascular diseases are the number one cause of death globally, as more people die annually from Heart diseases than from any other cause. In 2016 alone, an estimated 17.9 million people died from heart diseases , representing 31% of all global deaths.

Early detection of cardiac diseases and continuous supervision of patients can reduce this mortality rate. However, accurate detection of heart diseases in all cases and supervision of a patient 24*7 by a doctor is simply not feasible due to the lack of speacialized doctors.

The Statlog dataset is a Heart Disease Database that was dontated by the Cleaveland Clinic Foundation, Ohio in 1988 in the hope of encouraging people to come up with models that can predict heart disease. This dataset contains 13 attributes and a target variable, and the goal is to predict the presence or absence of heart disease in the patient(target variable).

The original dataset had codes as the values for almost all of the attributes, which have been substitued by their meanings in the data preparation stage for a better intuition of the dataset.

So Let’s have a glimpse at the dataset:

Looks good, but what do these attributes and their values mean? And which attributes are more critical in determining the patient’s heart condition?

Let us dive a little bit deeper and understand the dataset.

Back to Table of Contents

Structure of the Dataset

## 'data.frame':    270 obs. of  14 variables:
##  $ age           : num  70 67 57 64 74 65 56 59 60 63 ...
##  $ sex           : Factor w/ 2 levels "F","M": 2 1 2 2 1 2 2 2 2 1 ...
##  $ CPType        : Factor w/ 4 levels "Typical","Atypical",..: 4 3 2 4 2 4 3 4 4 4 ...
##  $ restBP        : num  130 115 124 128 120 120 130 110 140 150 ...
##  $ cholesterol   : num  322 564 261 263 269 177 256 239 293 407 ...
##  $ fastingBS_g120: Factor w/ 2 levels "FALSE","TRUE": 1 1 1 1 1 1 2 1 1 1 ...
##  $ restECG       : Factor w/ 3 levels "0","1","2": 3 3 1 1 3 1 3 3 3 3 ...
##  $ maxHR         : num  109 160 141 105 121 140 142 142 170 154 ...
##  $ ExIndAngina   : Factor w/ 2 levels "No","Yes": 1 1 1 2 2 1 2 2 1 1 ...
##  $ oldPeak       : num  2.4 1.6 0.3 0.2 0.2 0.4 0.6 1.2 1.2 4 ...
##  $ slopePEST     : Factor w/ 3 levels "Up","Flat","Down": 2 2 1 2 1 1 2 2 2 2 ...
##  $ nVessels      : Factor w/ 4 levels "0","1","2","3": 4 1 1 2 2 1 2 2 3 4 ...
##  $ thalD         : Factor w/ 3 levels "normal","fixed",..: 1 3 3 3 1 3 2 3 3 3 ...
##  $ HeartDisease  : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 2 2 2 2 ...

Attribute Description

  1. Age: Age (in years) of the patient at the time of admission in the hospital

  2. Sex: Gender of the patient (Female/Male)

  3. Chest Pain Type: Broadly classified into typical angina, atypical angina, non-anginal pain and asymptomatic pain.
    • Typical angina is the discomfort that may feel like a tightness or heaviness in the central chest that is noted when the heart does not get enough blood or oxygen.
    • Atypical chest pain is a term used to describe discomfort or pain centered in the chest that is not cardiac pain. It is not heart related and not of burning quality, and is rather a sharp, knife-like and pulsating sensation.
    • Non-Anginal Pain can be attributed to Cervical root compression pain or esophageal spasm, which are the greatest mimics of angina since they can both be relieved by nitroglycerin but they have several features which help to rule out angina.
    • Asymptomatic means neither causing nor exhibiting symptoms of disease. This is when the patient doesn’t show symptoms of chest pain.
  4. Resting blood pressure: Measured in mm Hg on admission to the hospital.
    • Readings over 120/80mm Hg and up to 139/89mm Hg are in the normal to high-normal range.
    • According to doctors, Blood pressure that’s high over a long time is one of the main risk factors for heart disease.
  5. Cholesterol: Measured in mg/dL at the time of admission.
    • Total cholesterol levels less than 200 milligrams per deciliter (mg/dL) are considered desirable for adults.
    • A reading between 200 and 239 mg/dL is considered borderline high.
    • A reading of 240 mg/dL and above is considered high.
  6. Fasting blood sugar: Whether it is greater than 120 mg/dL (Yes/No)
    • A fasting blood sugar level less than 100 mg/dL (5.6 mmol/L) is normal.
    • A fasting blood sugar level from 100 to 125 mg/dL (5.6 to 6.9 mmol/L) is considered prediabetes.
    • If its higher than that on two separate tests, you have diabetes.

    Over time, high blood glucose from diabetes can damage blood vessels and the nerves that control heart and blood vessels. The longer one has diabetes, the higher the chances that he/she will develop heart disease. In adults with diabetes, the most common causes of death are heart disease and stroke.

  7. Resting ECG Results: An Electrocardiogram (ECG) is a medical test that detects heart problems by measuring the electrical activity generated by the heart as it contracts. In layman’s terms, it is a Voltage Vs. Time graph for our heart. The results are categorized into 3 types:
    • Normal(Code 0) results mean that the ECG curve matches that of a healthy heart because it has a characteristic shape.
    • ST Wave Abnormality(Code 1) refers means there is a significant difference between the ECG’s ST Segment and that of a healthy heart.
    • Left ventricular hypertrophy(Code 2) is enlargement and thickening (hypertrophy) of the walls of your heart’s main pumping chamber (left ventricle). Left ventricular hypertrophy can develop in response to some factor - such as high blood pressure or a heart condition - that causes the left ventricle to work harder.
  8. Maximum heart Rate Achieved during exercise: Measured in beats per minute(bpm), the Maximum heart rate during excercise is the upper limit of what your cardiovascular system can handle during physical activity.

  9. Exercise induced angina: Whether or not exercise induced chest pain in the patient (Yes/No). Everyone, including people in excellent shape, can experience pain in their chest during exercise. The many potential causes range from benign to potentially life-threatening.

  10. Oldpeak: ST depression induced by exercise relative to rest. ST depression refers to a finding on an electrocardiogram, wherein the trace in the ST segment is abnormally low below the baseline. The ST segment represents the isoelectric period when the ventricles are in between depolarization and repolarization. The typical ST segment duration is usually around 80 ms.

  11. SlopePEST: Slope of the peak exercise ST segment curve(Upsloping/Downsloping/Flat). The ST segment represents the isoelectric period when the ventricles are in between depolarization and repolarization. The typical ST segment duration is usually around 80 ms.

  12. nVessels: Major vessels colored(0-3) during fluoroscopy.

  13. Thal: A thallium stress test is a nuclear imaging test that shows how well blood flows into your heart while you’re exercising or at rest, and is usually performed in a controlled, clinical environment. Results are - Normal blood flow/Fixed defect/Reversible defect

  14. Heart Disease: Target Variable (Yes or No)

Missing values in the dataset

##            age            sex         CPType         restBP    cholesterol 
##              0              0              0              0              0 
## fastingBS_g120        restECG          maxHR    ExIndAngina        oldPeak 
##              0              0              0              0              0 
##      slopePEST       nVessels          thalD   HeartDisease 
##              0              0              0              0

Fortunately, No missing values are present in our dataset.

Back to Table of Contents

Attribute Summary

##       age        sex              CPType        restBP     
##  Min.   :29.00   F: 87   Typical     : 20   Min.   : 94.0  
##  1st Qu.:48.00   M:183   Atypical    : 42   1st Qu.:120.0  
##  Median :55.00           Non-Anginal : 79   Median :130.0  
##  Mean   :54.43           Asymptomatic:129   Mean   :131.3  
##  3rd Qu.:61.00                              3rd Qu.:140.0  
##  Max.   :77.00                              Max.   :200.0  
##   cholesterol    fastingBS_g120 restECG     maxHR       ExIndAngina
##  Min.   :126.0   FALSE:230      0:131   Min.   : 71.0   No :181    
##  1st Qu.:213.0   TRUE : 40      1:  2   1st Qu.:133.0   Yes: 89    
##  Median :245.0                  2:137   Median :153.5              
##  Mean   :249.7                          Mean   :149.7              
##  3rd Qu.:280.0                          3rd Qu.:166.0              
##  Max.   :564.0                          Max.   :202.0              
##     oldPeak     slopePEST  nVessels        thalD     HeartDisease
##  Min.   :0.00   Up  :130   0:160    normal    :152   No :150     
##  1st Qu.:0.00   Flat:122   1: 58    fixed     : 14   Yes:120     
##  Median :0.80   Down: 18   2: 33    reversible:104               
##  Mean   :1.05              3: 19                                 
##  3rd Qu.:1.60                                                    
##  Max.   :6.20
  1. Age:
    • The range of patient’s ages in our sample is from 29 to 77.
    • Only the first quartile has patients aging below 48 years, all the rest 75 percent of observations(patients) are above 48 years of age.
    • 50 percent of the patients in our sample are aged between 48 and 61 years, which shows that generally, the observations lie close to our central value.
    • The average age of patients is 54.5 years and the median age is 55 years, which are pretty close.
  2. Gender:
    • The number of men in our sample is 183.
    • The number of women in our sample is 87, which is roughly half the number of men.
  3. Chest Pain Type:
    • The most common type of chest pain reported by the patients is Asymptomatic, which is experienced by 129 patients.
    • Only 20 patients reported Typical Chest Pain.
    • Atypical and Non-anginal Pain was reported by 42 and 79 patients respectively.
  4. Resting blood pressure:
    • The minimum Resting Blood Pressure in our sample is 94 mm Hg.
    • The maximum Resting Blood Pressure in our sample is 200 mm Hg.
    • The average Resting Blood Pressure of patients is 131 mm Hg, which is close to the median value of 130 mm Hg.
    • The Interquartile range(25 percentile to 75 percentile) concides with normal to high-normal range of Resting Blood pressure, which is 120-140 mm Hg.
  5. Cholesterol:
    • The minimum blood cholesterol level in our sample is 126 mg/dL.
    • The maximum blood cholesterol level in our sample is 564 mg/dL.
    • Both the mean and median values, which are 250 and 245 mg/dL respectively, are much higher than the desired cholesterol levels in a healthy adult.
    • Only 25 percent of the patients have normal cholesterol (less than 213 mg/dL), all the remaining 75 percent of patients have high cholesterol.
  6. Fasting blood sugar:
    • 40 out of the 270 patients have their fasting blood sugar greater than 120 mg/dL.
    • The remaining 230 patients have their fasting blood sugar greater than 120 mg/dL.
  7. Resting ECG Results:
    • 137 out of the 270 patients were reported to have Left ventricular hypertrophy(Code 2).
    • 131 patients had normal ECG results.
    • 2 Patients were reported to have ST Wave Abnormality(Code 1).
  8. Maximum heart Rate achieved during exercise:
    • The maximum heart Rate achieved by any patient of our sample was 202 Bpm.
    • Whereas the minimum heart Rate achieved during exercise in our sample was just 71 Bpm.
    • The average Heart rate during exercise in our sample was around 150 Bpm, whereas the median value was 153.5 Bpm.
  9. Exercise induced angina:
    • 181 patients didn’t feel any chest pain during exercise.
    • Whereas 89 patients experienced exercise induced angina (Chest pain).
  10. Oldpeak:It is the ST depression induced by exercise relative to rest.
    • The minimum value of this ratio is 0.
    • The maximum value is 6.20.
    • The average value is 1.05, whereas the median value is 0.80.
    • 25 percent of patients have
  11. SlopePEST:
    • 130 patients had upwards slope of their peak exercise ST segment.
    • Whereas 122 patients had a flat ST segment during peak exercise.
    • While 18 patients had a downwards slope of ST segment during peak exercise.
  12. nVessels: Major vessels colored(0-3) during fluoroscopy
    • 160 patients had no major vessels colored during flouroscopy.
    • Whereas 58 patients showed coloration in one major blood vessel during flouroscopy.
    • While 33 and 19 patients showed coloration in two and three major blood vessels respectively during flouroscopy.
  13. Thal: Thallium stress test results(Normal blood flow/Fixed defect/Reversible defect)
    • 152 patients had normal blood flow to the heart during the Thallium Stress Test.
    • Whereas 104 patients showed reversible defects in blood flow, i.e. which restored back to normal with rest.
    • While 14 patients either had fixed defects or developed fixed defects in blood flow to the heart after the Thallium Stress Test.

Let us now study each and attribute with plots and graphs, but first, lets have a look at the proportion of patients that had Heart disease and the proportion that didn’t:

From above graph it is clear that more than half of the sample is not suffering from Heart Disease, with percentage of 55.6%. Whereas 44.4% of the patients actually have heart disease.

Were there any telltale symptoms, or causes for these heart diseases?

Is there a way that one can accurately classify the patients based on the preliminary reports?

Let’s explore the features to get insights from the dataset and answer the above questions.

Back to Table of Contents

Univariate Analysis

Let us take each attribute of our dataset one by one and look further into their distributions rather than just looking at their summary statistics.

1. Age of Patients

With a skewness of -0.16, and the mean value of age being close to the median, This looks like a fairly symmetrical Normal Distribution, with a slight Left Skewness.

From the attribute summary, we know that only the first quartile has patients are aged below 48 years.

We also saw that 50 percent of our observations are aged between 48 and 61 years.

2. Gender of Patients

Female patients being diagnosed for Heart Disease are roughly half of the number of male patients.

3. Chest Pain Type

The most common chest pain, reported by 47.8 percent patients is Asymtomatic, which technically isn’t even a common chest pain unlike Typical and Atypical Angina, reported by 7.4 and 15.6 percent of patients respectively. Interestingly enough, the remaining 30 percent patients report Non-Anginal pain, which means the pain that isn’t experienced in the chest. While usually we connect chest pain with Heart diseases, this presents a rather new perspective to the types of pains reported by the patients in our sample.

4. Resting blood pressure

With a skewness value of +0.71, The distribution is skewed towards the right.

From the attribute summary, we know that 75 percent of the patients have a resting blood pressure below 140 mm Hg, which comes under normal to high-normal range, and can be seen from the histogram.

Whereas the remaining 25 percent of the patients have considerably high resting Blood Pressure.

According to doctors, Blood pressure that’s high over a long time is one of the main risk factors for heart disease, and we shall look further into this in the bi-variate analysis.

5. Cholesterol

Although the calculated skewness value is +1.17 which is very high, the distribution looks fairly symmetrical around the median with a handful of outliers.(cholesterol >=400mg/dL)

From our attribute description, normal cholesterol level for adults should be less than 200 mg/dL. And from our summary, only 25 percent of our sample has Cholesterol less than 213 mg/dL. All the remaining 75 percent (200 out of 270) have high cholesterol. Hence, this may not be a good criteria for classifying patients for heart disease, as mostly every patient has high cholesterol. Still, we’ll further look into its significance in our bi-variate analysis.

6. Fasting blood sugar

This is good, only 14.8 percent of the patients have a fasting blood sugar greater than 120 mg/dL, whereas the remaining 85.2 percent have fasting blood sugar lower than 120 mg/dL.

7. Resting ECG Results

The ECG results of all the patients show that 50 percent of the patients have Left ventricular hypertrophy, which is enlargement and thickening of the walls of your heart’s main pumping chamber (left ventricle). Whereas 48.5 percent of patients got normal ECG results and 0.7 percent (2 patients) reported ST wave abnormality.

8. Maximum Heart Rate Achieved during exercise

With the skewness value of -0.52, their is a moderate left skewness in distribution of maximum heart rate achieved during exercise.

The spread of the first quartile(71-133 BPM) alone is equal to the conbined spread of the remaining 75 percent of the data.

9. Exercise induced angina

67 percent of patients didn’t experience any chest pain during exercise, whereas 33 percent did. This maybe due to the weaker heart condition of the latter or some other underlying factors.

10. ST depression induced by Exercise (relative to rest measurement

With the Skewness of +1.25, their is a considerable right skewness, and the distribution of ST Depression Ratio is far from symmetrical.

The value of ST depression ratio coming out to be zero for a lot of patients means that there wasn’t a considerable difference in the ST segment of patients prior and after the exercise. Whereas, there are still a lot of patients who had a significant change in their ST segment after exercise.

11. SlopePEST

The most common slope of peak exercise ST segment is Upwards, reported by 48.1 percent patients, whereas 45.2 percent patients had a flat ST segment during exercise. The remaining 6.6 percent of patients had a downwards slope in their ST segment.

12. nVessels

59.3 percent patients had no major vessels colored during flouroscopy. Whereas 21.5 patients showed coloration in one major blood vessel. While 12.2 percent and 7 percent patients showed coloration in two and three major blood vessels respectively during flouroscopy.

This shows that a majority of patients (around 80 percent) had either no blockages in their major heart vessels, or had blockage in one major heart vessel.

13. Thallium Stress Test

56.3 percent of patients had normal Thallium stress test reports, meaning that these patients had normal blood flow to the heart during exercise.

Whereas 38.5 percent of the patients showed reversible defects in blood flow, i.e. which restored back to normal with rest.

While 5.2 percent patients either had fixed defects or developed fixed defects in blood flow to the heart after the Thallium Stress Test.

Back to Table of Contents

Summary of Univariate Analysis

Let us jort down the most significant points we came across in our univariate analysis, which are:

  • Patients undergoing diagnosis of Heart Disease are more likely to be 48 years or older.
  • A majority of the patients (75 percent, or 200 out of the total 270) have high cholesterol.
  • Only a minority (14.8 percent, or 40 out of the total 270) of the patients have high blood sugar.
  • Only 2 patients have ST-T wave abnormality in thier ECG results, whereas the remaining patients have either Left Ventricular Hypertrophy or have Normal reports (50 percent each).
  • Maximum heart rate achieved during exercise is correlated to the Age attribute in our dataset. We will look into this in our multivariate analysis.
  • ST depression induced by Exercise (OldPeak) attribute was quite unwieldy, since the ratio doesn’t give an insight as to what might be the reason for so many patients having no change in the ratio, and how is it related to the Slope of the peak exercise ST segment. We will look into this in our multivariate analysis.
  • A majority of patients (around 80 percent) had either no blockages in their major heart vessels, or had a blockage in one major heart vessel during rest. The case could be different under stress.
  • Around 40 percent of patients showed reversible defects and 5 percent showed fixed defects in the blood flow during Thallium stress test, which might be due to the partial blockage of the blood vessels. It might be the case that the blockage shows its effects only when the patient is stressed and needs more blood to be pumped from the heart.

Back to Table of Contents

Bivariate Analysis

With the summary of univariate analysis ready, let us now focus on our target variable (Heart Disease) and see how our attributes are able to explain it.

1. Gender wise distribution of Heart Disease

55 percent of the men in our sample have heart disease, which is far greater than the 23 percent of the females having heart disease. This seems to indicate that the gender of the patient is related to the Heart Disease, so let’s confirm that with a Hypothesis test:

2. Analysis of Age according to presence or absence of Heart Disease.

The age distribution of patients having heart disease seems to be different than the patients not having heart disease, so let’s confirm that with a Hypothesis test:

Does old age make you more susceptible to Heart Disease?

Let us test this using a Hypothesis Test, with a significance level of 0.05

  • Null Hypothesis, H0: The average age of patients with heart disease is similar to that of patients without heart disease.
  • Alternate Hypothesis, H1: The average age of patients with heart disease is higher than that of patients without heart disease.
## 
##  Two Sample t-test
## 
## data:  age_HeartDisease and age_NoHeartDisease
## t = 3.557, df = 268, p-value = 0.0002217
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  2.08222     Inf
## sample estimates:
## mean of x mean of y 
##  56.59167  52.70667

Indeed, the average age of patients with heart disease is higher than that of patients without heart disease, and our null hypothesis can by rejected, with a p value of 0.0002 which is much less than our significance level of 0.05

Older patients are much more likely to be diagnosed positive for heart disease, afterall our heart is a muscle too, and with age it becomes weaker.

3. Analysis of Chest Pain based on presence of Heart Disease

71 percent patients having asymptomatic chest pain had a heart disease. This is concerning because many heart diseases do not display any symptoms until it crosses some threshold, at which point the symptoms become immediately evident, and therefore are called asymptomatic.

Although we can prove with a hypothesis test that the type of chest pain is significantly related to Heart Disease, We don’t have any further information about the causes and signs of the asymptomatic category, so while it would help our model to classify heart disease, there would be no real world intuition behind this value of the Chest Pain attribute.

On the other hand, Typical chest pain, which can be easily diagnosed even by a non-speacialist, was a characteristic of 25 percent of the patients having heart disease. So, it shouldn’t be taken lightly either.

4. Resting blood pressure based on presence of Heart Disease

While the median value of the resting blood pressure is similar for patients with and without Heart Disease, there is a considerable spread in the values for the latter towards higher values. Let us confirm that with a Hypothesis test.

Do patients with Heart disease generally have Higher Blood Pressure?

  • Null Hypothesis, H0: The average Resting Blood Pressure of patients with heart disease is similar to that of patients without heart disease.
  • Alternate Hypothesis, H1: The average Resting Blood Pressure of patients with heart disease is higher than that of patients without heart disease.

We will use a T-test for this with a significance value of 0.05

## 
##  Two Sample t-test
## 
## data:  BP1 and BP2
## t = 2.575, df = 268, p-value = 0.00528
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  2.001458      Inf
## sample estimates:
## mean of x mean of y 
##  134.4417  128.8667

Indeed, the average Resting Blood Pressure of patients with heart disease is higher than that of patients without heart disease, and our null hypothesis can by rejected, with a p value of 0.005 which is much less than our significance level.

Patients with heart diease have weaker hearts due to the fact that the heart has to work harder even during rest because of the blockages in blood vessels, and there’s more pressure on our arteries. Higher resting blood pressure, thus should not be taken lightly.

5. Analysis of Cholesterol levels based on presence of Heart Disease

From our univariate analysis, we concluded that a majority of the patients (75 percent, or 200 out of the total 270) have high cholesterol, and the box-plot depicts similar distributions of cholesterol levels in patients with and without Heart Disease.

There doesn’t seem to be significant difference between the two groups, but let’s confirm that with a Hypothesis test:

Do patients with Heart disease generally have Higher cholesterol?

  • Null Hypothesis, H0: The average cholesterol level of patients with heart disease is similar to that of patients without heart disease.
  • Alternate Hypothesis, H1: The average cholesterol level of patients with heart disease is higher than that of patients without heart disease.

We will use a T-test for this with a significance value of 0.05

## 
##  Two Sample t-test
## 
## data:  Ch1 and Ch2
## t = 1.9457, df = 268, p-value = 0.05274
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1459672 24.6526339
## sample estimates:
## mean of x mean of y 
##  256.4667  244.2133

Indeed, the average cholesterol level of patients with heart disease is similar to that of patients without heart disease, and our alternate hypothesis can by rejected, with a p-value of 0.052 which is greater than our significance level.

This depicts that the average cholesterol levels in all the patients is higher than the normal levels, which may be due to repeated intake of fast food containing generous amounts of oil and the lack of exercise to reduce it. This is concerning because people are not taking healthy, nutritious diets that could help keep their cholesterol levels in check.

6. Analysis of Fasting blood sugar based on presence of Heart Disease

From our univariate analysis we found that 85 percent of the patients had a Fasting blood sugar less than 120 mg/dL, and out of those, 45 percent patients have heart disease.

On the other hand, out of all the patients having higher blood sugar than 120 mg/dL, 42 percent had heart disease.

7. Comparing ECG Results with presence of Heart Disease

65 percent of patients that had normal ECG results don’t actually have a heart disease.

On the other hand, out of all the patients that reported Left ventricular Hypertrophy, 53 percent had heart disease. Whereas, there’s a 50 percent chance of having heart disease if there’s a ST abnormality in the ECG results.

8. Analysis of Maximum Heart Rate Achieved during exercise based on presence of heart disease

Now this one is the most interesting attribute by far, as it clearly distinguishes the patients with and without heart disease based on the fact that a healthy person has much higher ability to push his/her cardiovascular system to the extreme levels, whereas a person showing symptoms of Heart Disease simply can’t.

This is great because it doesn’t require a doctor to conduct tests and tell us that we have a weak heart, rather it can be discovered even by regularly engaging in physical activities and being in touch with our bodies - by not ignoring the subtle signs that our body gives us.

Let’s prove this with a hypothesis test:

Do patients without Heart disease generally have Higher maximum Heart Rate during exercise?

  • Null Hypothesis, H0: The average maximum heart rate of patients with heart disease is similar to that of patients without heart disease.
  • Alternate Hypothesis, H1: The average maximum heart rate of patients without heart disease is higher than that of patients with heart disease.

We will use a T-test for this with a significance value of 0.05

## 
##  Two Sample t-test
## 
## data:  Hr1 and Hr2
## t = -7.5438, df = 268, p-value = 7.12e-13
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -24.55777 -14.39223
## sample estimates:
## mean of x mean of y 
##  138.8583  158.3333

Indeed, patients with healthy hearts are capable of having much higher maximum heart rates as compared to the patients which have heart disease, and our null hypothesis can by rejected, with a p-value that is much lesser than our significance level.

9. Analysis of Exercise induced angina based on presence of heart disease

This was foreseeable, as chest pain is one of the most common symptoms of heart disease, and 75 percent of patients experiencing chest pain after exercise actually had heart disease.

Similarly, on the other hand, 70 percent of the patients that didn’t feel any chest pain during exercise didn’t have any heart disease. Which reinforces the fact that exercise is very crucial to keep our heart healthy. We will further look into it in our multivariate analysis.

10. Analysis of Oldpeak attribute with presence of heart disease

The Relative ST Depression induced during Exercise is starting to make sense now, with patients not having heart disease having its value closer to zero, whereas the patients that actually have heart disease have a greater value. Let’s confirm it with a hypothesis test:

Do patients without Heart disease generally have lower OldPeak values?

  • Null Hypothesis, H0: The average Old Peak value of patients with heart disease is similar to that of patients without heart disease.
  • Alternate Hypothesis, H1: The average Old Peak value of patients without heart disease is lower than that of patients with heart disease.

We will use a T-test for this with a significance value of 0.05

## 
##  Two Sample t-test
## 
## data:  OP1 and OP2
## t = 7.5319, df = 268, p-value = 3.839e-13
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  0.7507938       Inf
## sample estimates:
## mean of x mean of y 
## 1.5841667 0.6226667

Indeed, patients with healthy hearts have much lower induced ST Depression during exercise as compared to the patients which have heart disease, and our null hypothesis can by rejected, with a p-value that is much lesser than our significance level.

11. Analysis of Peak Exercise ST Segment with presence of Heart Disease

75 percent of patients having upward slope in Peak Exercise ST Segment don’t actually have a heart disease.

On the other hand, out of all the patients that had a flat slope in Peak Exercise ST Segment, 64 percent had heart disease. Whereas, 56 percent of patients having a downward slope in Peak Exercise ST Segment have a heart disease.

12. Analysis of Flouroscopy results with presence of Heart disease

75 percent of patients having no major vessels colored during flouroscopy don’t actually have a heart disease.

On the other hand, out of all the patients that had 1 major vessel colored, 66 percent had heart disease, while 79 percent of patients that had 2 major vessels colored had heart disease, and 84 percent of patients that had 3 major vessels colored had heart disease.

13. Analysis of Thallium Stress test with presence of Heart Disease

78 percent of patients having normal Thallium Stress test results don’t actually have a heart disease, which depicts the accuracy of the Thallium Stress test.

On the other hand, out of all the patients that reported reversible defects in blood flow during the test, 76 percent had heart disease.

Whereas, there’s a 57 percent of patients had heart disease if they reported fixed defects in blood flow during the test.

Back to Table of Contents

Summary of Bivariate Analysis

Let us now jort down the most significant points we came across in our bivariate analysis, which are:

  • Older patients are much more likely to be diagnosed positive for heart disease, afterall our heart is a muscle too, and with age, it starts to become weaker and weaker, which is true for both men and women.
  • 71 percent of the patients reporting asymptomatic chest pain have a heart disease. This is concerning because many heart diseases do not display many symptoms until it crosses some threshold, at which point the symptoms become immediately evident and it is too late.
  • Patients with heart diease have weaker hearts due to the fact that the heart has to work harder even during rest because of the partial blockages in blood vessels. Higher resting blood pressure, thus should not be taken lightly.
  • The average cholesterol levels in all the patients is higher than the normal levels, which may be due to repeated intake of fast food containing generous amounts of oil and the lack of exercise to reduce it. This is concerning because people are not taking healthy, nutritious diets that could help keep their cholesterol levels in check.
  • A healthy person has much higher ability to push his/her cardiovascular system to the extreme levels, whereas a person showing symptoms of Heart Disease simply can’t. Like other muscles of our body, we need to focus on our heart and immediately consult the doctor in case of any discomfort during exercise.
  • Patients with healthy hearts have much lower induced ST Depression during exercise, as compared to the patients which have heart disease.

Back to Table of Contents

Multivariate Analysis

Let’s look at our attributes one last time, and try to study the contributing factors to heart disease in multivariate analysis of the data.

Our dataset has 9 categorical varaibles, and 5 continuous variables. The correlation matrix of the latter is as follows:

The attributes in our dataset have low to medium correlation, with the most prominently correlated attributes being: Also, The most significant attributes that we came across in our univariate and bivariate analysis were:

We will now plot the target variable(Presence/Absence of heart disease) along with the above four attributes.

1. Age, MaxHR, Thallium Test results, Target Variable

Key-
  • x axis: Age of the patient
  • y axis: Maximum Heart Rate achieved during Exercise (Bpm)
  • Faceting: Thallium Stress Test results(Normal/Fixed Defect/Reversible Defect)
  • Color: Heart Disease (Yes/No)

Healthy patients (that don’t have heart disease and also had normal reports in Thallium Stress Test) show a moderate downhill relationship between the Maximum Heart Rate and Age, with patients generally having higher Maximum Heart Rates because their hearts are capable of being pushed to the extreme. They seem to follow the general rule of thumb given by doctors:

Maximum Heart Rate = 220 - Age

Whereas no such relationship can be seen for Patients with Heart Disease.

Although this can’t be seen as a reason for having Heart Disease, but is a pretty good sign for detecting possible Heart risks. People that engage in physical exercise on a regular basis would be better able to detect whether their heart is functioning properly or not. And early diagnosis translates to better chances of survival.

2. Age, OldPeak, Thallium Test results, Target Variable

Key-
  • x axis: Age of the patient
  • y axis: OldPeak (Relative ST Depression Induced during Exercise)
  • Faceting: Thallium Stress Test results(Normal/Fixed Defect/Reversible Defect)
  • Color: Heart Disease (Yes/No)

Healthy patients (that don’t have heart disease) generally have no change in their ST Depression before and after Exercise, and even if they do, it is much less than that of patients having heart disease.

When combined with the Thallium stress test results, patients with higher OldPeak ratio and reversible defects generally have heart disease. This too, although doesn’t seem to be a reason for causing Heart Disease, but rather as a detection measure. With fitness trackers capable of monitoring ECG, a person can be alerted in case of sudden change in his/her ST depression and immediately get himself/herself diagnosed. And as always, early diagnosis translates to better chances of survival.

Back to Table of Contents

Conclusion

Summing up our Exploratory Data analysis, we conclude on the following points:

Back to Table of Contents

#Predictive Modelling

Predictive analytics is an area of statistics that deals with extracting information from data and using it to predict trends and/or behavioral patterns. We are going to use various Machine Learning algorithms in order to predict whether a patient has heart disease or not, based on the predictor variables in our dataset.

Learning, for a computer is nothing but encoding information about the environment into the parameters of the model.

Splitting the data

We are splitting our dataset into training data and test data. The training set includes the target variable and the model learns on this data in order to be generalized to other data later on. We have the test dataset in order to test our model’s prediction after the model has been trained. A 70:30 split is what we will use to randomly put patients in training or testing subset.

Checking the number of heart disease patients in each set

Training Set

##  No Yes 
## 108  81

The training subset has 81 patients with heart disease (roughly 43 percent of the training subset) and 108 patients without heart disease, which constitutes the remaining 57 percent of the training subset. The training subset has 189 observations, or rows.

Testing Set

##  No Yes 
##  42  39

The testing subset has 39 patients with heart disease (roughly 48 percent of the testing subset) and 42 patients without heart disease, which constitutes the remaining 52 percent of the testing subset. The testing subset has 81 observations.

With the training and testing subsets ready, we can now apply Machine Learning models and use their accuracy and various other metrics to choose the model that outshines the others for our problem, that is predicting heart disease in patients.

Since ours is a Classification problem, we will use the following algorithms:

  1. Logistic Regression

  2. Decision Trees

  3. Random Forests

Logistic Regression

Logistic Regression, unlike its name suggests, is a classification algorithm, that is used when the response variable is categorical. The idea of Logistic Regression is to find a relationship between the features and the probability of a particular outcome of the target variable. To be more specific, we are using Binomial Logistic Regression, since our target variable is categorical with only two classes - Yes or No.

We will apply Logistic regression in he following fashion and then discuss the implications of each one of them:

  1. Using all predictor variables/features

  2. Using Backward Propogation for selecting important features

  3. Using only the significant features from the EDA

1. Using all predictor variables

## 
## Call:
## glm(formula = HeartDisease ~ ., family = "binomial", data = train_set)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7259  -0.3923  -0.1023   0.2349   2.5820  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -10.271779   4.391845  -2.339 0.019344 *  
## age                 -0.016210   0.031761  -0.510 0.609792    
## sexM                 1.892395   0.807368   2.344 0.019083 *  
## CPTypeAtypical       2.416515   1.188629   2.033 0.042050 *  
## CPTypeNon-Anginal    0.869087   1.067521   0.814 0.415578    
## CPTypeAsymptomatic   3.021381   1.075116   2.810 0.004950 ** 
## restBP               0.037381   0.018785   1.990 0.046589 *  
## cholesterol          0.008521   0.006061   1.406 0.159801    
## fastingBS_g120TRUE  -0.120405   0.705568  -0.171 0.864500    
## restECG1             0.698244   4.779747   0.146 0.883855    
## restECG2             0.972139   0.565877   1.718 0.085809 .  
## maxHR               -0.021902   0.014745  -1.485 0.137440    
## ExIndAnginaYes       0.661797   0.592564   1.117 0.264064    
## oldPeak              0.743121   0.338660   2.194 0.028214 *  
## slopePESTFlat        1.025289   0.628569   1.631 0.102859    
## slopePESTDown        0.089005   1.347028   0.066 0.947318    
## nVessels1            2.582653   0.676253   3.819 0.000134 ***
## nVessels2            3.371001   0.979360   3.442 0.000577 ***
## nVessels3            2.702779   1.036139   2.609 0.009094 ** 
## thalDfixed          -0.510212   1.053455  -0.484 0.628157    
## thalDreversible      0.941636   0.646732   1.456 0.145395    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 258.14  on 188  degrees of freedom
## Residual deviance: 105.74  on 168  degrees of freedom
## AIC: 147.74
## 
## Number of Fisher Scoring iterations: 6
  • Each individual category of categorical variable is considered as an independent binary variable by default, this saves us the trouble of creating dummy variables in our dataset.
  • Deviance is a measure of goodness of fit of a model, or to be precise, it is the measure of badness of fit of a model. as the lesser the deviance, the better the fit.
    • Null deviance tells us how well the response is predicted by a model with nothing but the intercept, i.e. no predictor variables.
    • Residual deviance is much lower than the null deviance. This points out that the model has indeed improved with the addition of predictor variables.
  • AIC: Its full form is Akaike Information Criterion (AIC). This is useful when we have more than one model to compare the goodness of fit of the models.It is a maximum likelihood estimate which penalizes to prevent overfitting. It measures flexibility of the models. Lower AIC of model is better than the model having higher AIC.
  • Number of Fisher Scoring iterations tells us how many iterations this algorithm ran before it stopped.

Training Accuracy

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  100  14
##        Yes   8  67
##                                           
##                Accuracy : 0.8836          
##                  95% CI : (0.8291, 0.9256)
##     No Information Rate : 0.5714          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.7601          
##                                           
##  Mcnemar's Test P-Value : 0.2864          
##                                           
##             Sensitivity : 0.9259          
##             Specificity : 0.8272          
##          Pos Pred Value : 0.8772          
##          Neg Pred Value : 0.8933          
##              Prevalence : 0.5714          
##          Detection Rate : 0.5291          
##    Detection Prevalence : 0.6032          
##       Balanced Accuracy : 0.8765          
##                                           
##        'Positive' Class : No              
## 
  • Accuracy of our model is defined as the correct classifications divided by the total number of classifications.
  • Kappa is similar to Accuracy score, but it takes into account the accuracy that would have happened anyway through random predictions. Kappa = (Observed Accuracy - Expected Accuracy) / (1 - Expected Accuracy)
  • Sensitivity = True Positive Rate (TP/TP+FN) - It says, out of all the positive (majority class) values, how many have been predicted correctly’.
  • Specificity = True Negative Rate (TN/TN +FP) - It says, out of all the negative (minority class) values, how many have been predicted correctly.
  • Prevalence is the proportion of a population who have a specific characteristic in a given population. It is calculated as : Prevalence = (TP + TN)/(TP + TN + FP + FN)
  • Detection rate is the proportion of the whole sample where the events were detected as True positives. Detection Rate = TP /(TP + FP + TN + FN)
  • Balanced Accuracy takes into account the class imbalance, as it is calculated from sensitivity and specificity, rather than the actual data. It is thus, a better metric than plain accuracy. Balanced Accuracy=(Sensitivity+Specificity)/2.

Testing Accuracy

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction No Yes
##        No  41  12
##        Yes  1  27
##                                           
##                Accuracy : 0.8395          
##                  95% CI : (0.7412, 0.9117)
##     No Information Rate : 0.5185          
##     P-Value [Acc > NIR] : 1.406e-09       
##                                           
##                   Kappa : 0.6753          
##                                           
##  Mcnemar's Test P-Value : 0.005546        
##                                           
##             Sensitivity : 0.9762          
##             Specificity : 0.6923          
##          Pos Pred Value : 0.7736          
##          Neg Pred Value : 0.9643          
##              Prevalence : 0.5185          
##          Detection Rate : 0.5062          
##    Detection Prevalence : 0.6543          
##       Balanced Accuracy : 0.8342          
##                                           
##        'Positive' Class : No              
## 

2. Using backward Elimination for Feature selection

Machine Learning isn’t all about the algorithms and the mathematics behind it, but it is also equally about the data we’re feeding to the algorithm.

## 
## Call:
## glm(formula = HeartDisease ~ sex + CPType + restBP + cholesterol + 
##     maxHR + oldPeak + nVessels, family = "binomial", data = train_set)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5244  -0.4379  -0.1376   0.3362   2.3648  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -8.567483   3.204039  -2.674 0.007496 ** 
## sexM                2.005196   0.679469   2.951 0.003166 ** 
## CPTypeAtypical      2.100214   1.084741   1.936 0.052850 .  
## CPTypeNon-Anginal   0.787581   0.964747   0.816 0.414294    
## CPTypeAsymptomatic  3.184125   0.921026   3.457 0.000546 ***
## restBP              0.033723   0.015764   2.139 0.032416 *  
## cholesterol         0.010274   0.005351   1.920 0.054843 .  
## maxHR              -0.030320   0.012863  -2.357 0.018411 *  
## oldPeak             0.857738   0.270593   3.170 0.001525 ** 
## nVessels1           2.485559   0.609459   4.078 4.54e-05 ***
## nVessels2           3.013997   0.814165   3.702 0.000214 ***
## nVessels3           2.565571   0.967780   2.651 0.008026 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 258.14  on 188  degrees of freedom
## Residual deviance: 117.87  on 177  degrees of freedom
## AIC: 141.87
## 
## Number of Fisher Scoring iterations: 6

Training Accuracy

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction No Yes
##        No  98  16
##        Yes 10  65
##                                          
##                Accuracy : 0.8624         
##                  95% CI : (0.805, 0.9081)
##     No Information Rate : 0.5714         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.7165         
##                                          
##  Mcnemar's Test P-Value : 0.3268         
##                                          
##             Sensitivity : 0.9074         
##             Specificity : 0.8025         
##          Pos Pred Value : 0.8596         
##          Neg Pred Value : 0.8667         
##              Prevalence : 0.5714         
##          Detection Rate : 0.5185         
##    Detection Prevalence : 0.6032         
##       Balanced Accuracy : 0.8549         
##                                          
##        'Positive' Class : No             
## 

Testing Accuracy

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction No Yes
##        No  40  16
##        Yes  2  23
##                                           
##                Accuracy : 0.7778          
##                  95% CI : (0.6717, 0.8627)
##     No Information Rate : 0.5185          
##     P-Value [Acc > NIR] : 1.342e-06       
##                                           
##                   Kappa : 0.5492          
##                                           
##  Mcnemar's Test P-Value : 0.002183        
##                                           
##             Sensitivity : 0.9524          
##             Specificity : 0.5897          
##          Pos Pred Value : 0.7143          
##          Neg Pred Value : 0.9200          
##              Prevalence : 0.5185          
##          Detection Rate : 0.4938          
##    Detection Prevalence : 0.6914          
##       Balanced Accuracy : 0.7711          
##                                           
##        'Positive' Class : No              
## 

3. Using significant variables

## 
## Call:
## glm(formula = HeartDisease ~ sex + CPType + restBP + restECG + 
##     ExIndAngina + nVessels + slopePEST, family = "binomial", 
##     data = train_set)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5841  -0.4434  -0.1211   0.3581   2.5739  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -12.5759     3.0170  -4.168 3.07e-05 ***
## sexM                 2.0408     0.6440   3.169 0.001529 ** 
## CPTypeAtypical       2.2371     1.1306   1.979 0.047841 *  
## CPTypeNon-Anginal    1.0050     1.0378   0.968 0.332869    
## CPTypeAsymptomatic   3.2688     1.0010   3.266 0.001093 ** 
## restBP               0.0405     0.0171   2.369 0.017823 *  
## restECG1             1.4902     2.7896   0.534 0.593221    
## restECG2             0.8434     0.4943   1.706 0.087996 .  
## ExIndAnginaYes       1.1024     0.5304   2.078 0.037666 *  
## nVessels1            2.3649     0.5968   3.963 7.40e-05 ***
## nVessels2            3.5037     0.8961   3.910 9.23e-05 ***
## nVessels3            3.0284     0.9368   3.233 0.001226 ** 
## slopePESTFlat        1.9170     0.5203   3.684 0.000229 ***
## slopePESTDown        1.6295     1.0121   1.610 0.107404    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 258.14  on 188  degrees of freedom
## Residual deviance: 120.27  on 175  degrees of freedom
## AIC: 148.27
## 
## Number of Fisher Scoring iterations: 6

Training Accuracy

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction No Yes
##        No  93  16
##        Yes 15  65
##                                           
##                Accuracy : 0.836           
##                  95% CI : (0.7753, 0.8858)
##     No Information Rate : 0.5714          
##     P-Value [Acc > NIR] : 7.009e-15       
##                                           
##                   Kappa : 0.6646          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.8611          
##             Specificity : 0.8025          
##          Pos Pred Value : 0.8532          
##          Neg Pred Value : 0.8125          
##              Prevalence : 0.5714          
##          Detection Rate : 0.4921          
##    Detection Prevalence : 0.5767          
##       Balanced Accuracy : 0.8318          
##                                           
##        'Positive' Class : No              
## 

Testing Accuracy

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction No Yes
##        No  41  13
##        Yes  1  26
##                                          
##                Accuracy : 0.8272         
##                  95% CI : (0.727, 0.9022)
##     No Information Rate : 0.5185         
##     P-Value [Acc > NIR] : 6.49e-09       
##                                          
##                   Kappa : 0.65           
##                                          
##  Mcnemar's Test P-Value : 0.003283       
##                                          
##             Sensitivity : 0.9762         
##             Specificity : 0.6667         
##          Pos Pred Value : 0.7593         
##          Neg Pred Value : 0.9630         
##              Prevalence : 0.5185         
##          Detection Rate : 0.5062         
##    Detection Prevalence : 0.6667         
##       Balanced Accuracy : 0.8214         
##                                          
##        'Positive' Class : No             
## 

Comparing all Logistic Regression Models

Decision Tree

A decision tree is a tree where each node represents a feature(or attribute), each link( or branch) represents a decision based on the value of the respective feature and each leaf represents an outcome, which is categorical in our case.

The above decision tree asserts that-

The patient doesn’t have heart disease under the following circumstances:

  1. When nVessels is 0 & CPType is Typical or Atypical or Non-Anginal.

  2. When nVessels is 0 & CPType is Asymptomatic & ExIndAngina is No.

  3. When nVessels is 1 or 2 or 3 & CPType is Typical or Non-Anginal & slopePEST is Up or Down.

Whereas the patient has heart disease under the following circumstances:

  1. When nVessels is 0 & CPType is Asymptomatic & ExIndAngina is Yes.

  2. When nVessels is 1 or 2 or 3 & CPType is Typical or Non-Anginal & slopePEST is Flat.

  3. When nVessels is 1 or 2 or 3 & CPType is Atypical or Asymptomatic.

Training Accuracy

## Confusion Matrix and Statistics
## 
##      
##       No Yes
##   No  91  11
##   Yes 17  70
##                                           
##                Accuracy : 0.8519          
##                  95% CI : (0.7931, 0.8992)
##     No Information Rate : 0.5714          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.7003          
##                                           
##  Mcnemar's Test P-Value : 0.3447          
##                                           
##             Sensitivity : 0.8426          
##             Specificity : 0.8642          
##          Pos Pred Value : 0.8922          
##          Neg Pred Value : 0.8046          
##              Prevalence : 0.5714          
##          Detection Rate : 0.4815          
##    Detection Prevalence : 0.5397          
##       Balanced Accuracy : 0.8534          
##                                           
##        'Positive' Class : No              
## 

Testing Accuracy

## Confusion Matrix and Statistics
## 
##      
##       No Yes
##   No  36  12
##   Yes  6  27
##                                           
##                Accuracy : 0.7778          
##                  95% CI : (0.6717, 0.8627)
##     No Information Rate : 0.5185          
##     P-Value [Acc > NIR] : 1.342e-06       
##                                           
##                   Kappa : 0.5525          
##                                           
##  Mcnemar's Test P-Value : 0.2386          
##                                           
##             Sensitivity : 0.8571          
##             Specificity : 0.6923          
##          Pos Pred Value : 0.7500          
##          Neg Pred Value : 0.8182          
##              Prevalence : 0.5185          
##          Detection Rate : 0.4444          
##    Detection Prevalence : 0.5926          
##       Balanced Accuracy : 0.7747          
##                                           
##        'Positive' Class : No              
## 

The accuracy (correct classifications / total classifications) of the model including all the predictor variables is about 70 % whereas balanced accuracy is about 69.8%.

The Kappa statistic (or value) is a metric that compares an Observed Accuracy with an Expected Accuracy (random chance). Kappa = (Observed Accuracy - Expected Accuracy) / (1 - Expected Accuracy)

Sensitivity that is the number of correct positive predictions divided by the total number of positives is 83%. Specificity that is the number of correct negative predictions divided by the total number of negatives is 56%.

Prevalence is a numeric value or matrix for the rate of the “positive” class of the data.

Detection rate is the proportion of the whole sample where the events were detected correctly.

Detection Prevalence tells What percentage of the full sample was predicted as Healthy.

Balanced Accuracy is the balance between correctly predicting the Heart disease in a patient. Balanced Accuracy=(sensitivity+specificity)/2

Random Forest

## 
## Call:
##  randomForest(formula = HeartDisease ~ sex + CPType + restBP +      restECG + ExIndAngina + nVessels + slopePEST, data = train_set) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 20.11%
## Confusion matrix:
##     No Yes class.error
## No  92  16   0.1481481
## Yes 22  59   0.2716049

Here number of trees are 500 , and on an average, OOB error is about 17% so the accuracy is 83%.

As the forest is built on training data , each tree is tested on the 1/3rd of the samples not used in building that tree. This is the out of bag error estimate - an internal error estimate of a random forest as it is being constructed. Confusion Matrix also gives the error in predicting a particular class.

Testing the accuracy of Random Forest

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction No Yes
##        No  41  16
##        Yes  1  23
##                                           
##                Accuracy : 0.7901          
##                  95% CI : (0.6854, 0.8727)
##     No Information Rate : 0.5185          
##     P-Value [Acc > NIR] : 3.951e-07       
##                                           
##                   Kappa : 0.5738          
##                                           
##  Mcnemar's Test P-Value : 0.000685        
##                                           
##             Sensitivity : 0.9762          
##             Specificity : 0.5897          
##          Pos Pred Value : 0.7193          
##          Neg Pred Value : 0.9583          
##              Prevalence : 0.5185          
##          Detection Rate : 0.5062          
##    Detection Prevalence : 0.7037          
##       Balanced Accuracy : 0.7830          
##                                           
##        'Positive' Class : No              
## 

Accuracy(correct classifications / total classifications) is 79% whereas balanced accuracy is 78% for prediction on test data.

Kappa is similar to Accuracy score, but it takes into account the accuracy that would have happened anyway through random predictions.

Kappa = (Observed Accuracy - Expected Accuracy) / (1 - Expected Accuracy)

Sensitivity = True Positive Rate (TP/TP+FN) - It says, ‘out of all the positive (majority class) values, how many have been predicted correctly’.

Specificity = True Negative Rate (TN/TN +FP) - It says, ‘out of all the negative (minority class) values, how many have been predicted correctly’.

Here, Sensitivity is about 95% where as specificity is about 61%.

Prevalence is a numeric value or matrix for the rate of the “positive” class of the data.

Detection rate is the proportion of the whole sample where the events were detected correctly.

Detection Prevalence tells What percentage of the full sample was predicted as Healthy.

Balanced Accuracy is the balance between correctly predicting the Heart disease in a patient. Balanced Accuracy=(sensitivity+specificity)/2

Cross Validation Method on Random Forest

The k-fold cross validation method involves splitting the dataset into k-subsets. For each subset is held out while the model is trained on all other subsets. This process is completed until accuracy is determined for each instance in the dataset, and an overall accuracy estimate is provided. It is a robust method for estimating accuracy. Here we have taken k=5.

## Random Forest 
## 
## 270 samples
##   7 predictor
##   2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (3 fold) 
## Summary of sample sizes: 180, 180, 180 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.8185185  0.6278082
##    7    0.7703704  0.5338010
##   13    0.7481481  0.4910409
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

Cross Validation Method on Logistic regression

The k-fold cross validation method involves splitting the dataset into k-subsets. For each subset is held out while the model is trained on all other subsets. This process is completed until accuracy is determine for each instance in the dataset, and an overall accuracy estimate is provided. It is a robust method for estimating accuracy. Here we have taken k=5.

## Generalized Linear Model 
## 
## 270 samples
##   7 predictor
##   2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 216, 216, 216, 216, 216 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.7962963  0.5839538

Cross Validation Method on Decision Trees

The k-fold cross validation method involves splitting the dataset into k-subsets. For each subset is held out while the model is trained on all other subsets. This process is completed until accuracy is determine for each instance in the dataset, and an overall accuracy estimate is provided. It is a robust method for estimating accuracy. Here we have taken k=5.

## CART 
## 
## 270 samples
##   7 predictor
##   2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 216, 216, 216, 216, 216 
## Resampling results across tuning parameters:
## 
##   cp          Accuracy   Kappa    
##   0.02083333  0.7666667  0.5264599
##   0.03055556  0.7555556  0.5027920
##   0.44166667  0.6518519  0.2567796
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.02083333.

Conclusion of the Predictive Analysis